Latent Features of Numbers Learned by Sequence Models

by Peter de Blanc + ChatGPT Deep Research
Posted to Adarie (www.adarie.com) on April 16, 2025
Content License: Creative Commons CC0 (No Rights Reserved)

Researchers have developed various ways to embed integers as distinct tokens in sequence modeling tasks (e.g. using OEIS data). In these approaches, each number is treated like a “word” with its own vector representation, allowing models to learn mathematical relationships from the contexts in which numbers appear. Below we summarize the key techniques for learning such embeddings, how the infinite vocabulary of integers is managed, and what semantic structure emerges in the learned vector spaces.

Methods for Learning Number Embeddings as Tokens

Distributional word-embedding models: A straightforward approach is to treat each integer as a token and each sequence of numbers as a “sentence,” then apply standard word embedding techniques. For example, Ryskina and Knight (2021) train skip-gram embeddings (similar to Word2Vec/FastText) on the OEIS corpus, so that each number has a vector learned from its surrounding numbers in sequences (Ryskina-Knight-2021-integers-poster) (). They also apply Latent Semantic Analysis (LSA) by factorizing a number–sequence co-occurrence matrix, yielding latent vectors for each number (). These distributional methods ignore sequence order (LSA) or use local context windows (skip-gram) to embed numbers based on co-occurrence patterns.
Neural sequence models: Another method is to learn embeddings as part of training a sequence model (like a language model) on integer sequences. For instance, Ryskina et al. train an LSTM on OEIS sequences; the LSTM’s input embedding matrix (100-dimensional) provides a vector for each number token () (). In effect, the model learns to position numbers in vector space such that the next-number prediction is accurate, encoding mathematical patterns in the embedding weights. Similarly, Transformer-based models have been attempted. A 2022 study introduced “BERTIS” (BERT for Integer Sequences) which uses a Transformer to learn number representations via masked prediction on sequences (Pretrained Model for Understanding of Integer Sequences) (Pretrained Model for Understanding of Integer Sequences). BERTIS follows BERT-style pretraining but modifies the objectives for numbers: instead of a softmax over a huge vocabulary, it uses a regression objective to predict the masked number’s value (with mean-squared error) (Pretrained Model for Understanding of Integer Sequences). This allows the model to learn numeric context without enumerating all possible outputs.
Other specialized frameworks: In a recent proposal, researchers have considered metric-learning approaches on scientific text corpora to embed numbers. For example, one idea is to train a Transformer on a large math literature corpus (e.g. arXiv papers) with a contrastive objective that pulls similar numbers together in embedding space (Number Search Engine via NN Embeddings · Gwern.net) (Number Search Engine via NN Embeddings · Gwern.net). Here “similar” might mean the numbers co-occur in the same context or are related by known transformations (squaring, logarithms, etc.), and the model (with character-level tokenization to preserve full numbers) would learn embeddings that reflect these relations. This is conceptually akin to how CLIP models link related items, and it treats each distinct number as an atomic token in the training data (Number Search Engine via NN Embeddings · Gwern.net). While this specific approach is a proposal, it shows the growing interest in learning number embeddings from any number-heavy corpora (not just OEIS) by leveraging contextual similarity in large datasets.

Vocabulary Size and Handling of Infinite Integers

A core challenge in treating numbers as tokens is the infinite vocabulary of integers. In practice, researchers impose limits so that only certain numbers get dedicated embeddings:

Frequency or dataset-based cutoff: A common strategy is to include only numbers that appear sufficiently often in the training data, and map all others to an “unknown” token. For example, Ryskina et al. include an integer in the OEIS vocabulary only if it occurs at least 3 times in the corpus; all less frequent numbers are replaced by a universal UNK symbol (). This yielded a vocabulary of ~133k distinct integers for OEIS training, covering the majority of tokens in sequences and filtering out extremely rare (often very large) numbers (). The rationale is that a number seen only once or twice cannot have a reliable embedding learned from context. (Indeed, in the OEIS data about 83% of unique integers appear only once (), so without filtering, the model would waste capacity on many one-off tokens.) At test time, any number not in the training vocab is treated as unknown ().
Range-based limits and special tokens: Another criterion is by magnitude or some rule-based cutoff. For instance, Gauthier & Urban (2022) in a program-synthesis approach to OEIS handled extremely large integers by using a special placeholder token for “big” numbers ([2202.11908] Learning Program Synthesis for Integer Sequences from Scratch). They represented most integers by either digit embeddings (for those within a certain size) or by composing smaller tokens, but anything beyond a set range (e.g. >10^6 in absolute value) was not given its own vector—such out-of-range numbers were encoded via a generic <BIG> token’s embedding ([2202.11908] Learning Program Synthesis for Integer Sequences from Scratch). This hybrid strategy ensured that ~91.6% of OEIS sequences could still be uniquely identified by their token sequences while keeping the embedding vocabulary finite ([2202.11908] Learning Program Synthesis for Integer Sequences from Scratch). In general, a model designer might choose a maximum N and give every integer up to N an embedding, lumping all larger ones into a single category.
Subword or character-based encoding: Some researchers eschew dedicated numeric tokens altogether and represent numbers through their digit characters. While this goes beyond treating each number as atomic, it is worth noting as an alternative for handling OOV numbers. The FastText embeddings trained by Ryskina et al. illustrate this: one variant was trained with subword units (digits) enabled, meaning the vector for a number is influenced by character n-grams of its string (e.g. “128” is partly represented by n-grams like “<12”, “128”, “28>”) (). This allows the model to infer embeddings for unseen numbers (since their digits can be fed into the same subword model). It was found that using digit-level information helped the embedding capture certain properties (like divisibility rules tied to last digits) better (). However, the focus of most studies here is on direct embeddings without breaking into digits, in which case subword techniques are turned off () and the unknown-token strategy is used for any integer outside the chosen vocab.
Numeric feature embeddings: Another way to cope with infinite values is to encode numbers via features instead of a one-hot vocabulary index. The BERTIS model mentioned above takes this approach. Rather than relying purely on an ID look-up for an unseen number, BERTIS enhances each input integer with a set of generated features (e.g. analogues of digit embeddings or other hand-engineered numeric descriptors) before passing it into the Transformer (Pretrained Model for Understanding of Integer Sequences). In masked training, it predicts a continuous value for the missing number, using an MSE loss instead of a classification loss (Pretrained Model for Understanding of Integer Sequences). This design lets the model generalize to numbers not seen during training, since the prediction is based on learned relationships between feature representations, not on picking from a fixed list of tokens. In summary, approaches like BERTIS treat numerals more like regression targets, avoiding an unbounded softmax over all integers. While this strays from the pure “each number has its own learned vector” paradigm, it’s a practical solution to incorporate numeracy in sequence models without an exploding vocabulary.

In practice, projects that embed numbers as tokens choose a cutoff that suits their data and goals. In OEIS-based experiments, including all numbers up to moderate frequency (and using UNK for the rest) has proven effective (). For other corpora, one might include, say, the top K most frequent numbers (if dealing with something like a scientific corpus where certain constants recur) and treat others as unknown or use digit decomposition. The overarching aim is to balance coverage (having embeddings for the important numbers in the domain) with generalization (not overfitting to rare ids and being able to handle new numbers reasonably).

Semantic Structure in Learned Number Embeddings

Even though these embeddings are learned without explicit labels for number properties, researchers have found that certain human-interpretable numerical concepts emerge as directions or clusters in the vector space. Probing and visualizing number embeddings has revealed axes that correspond to fundamental number properties:

Parity (even vs. odd): One of the clearest latent features is evenness. In embeddings learned from OEIS sequences, a single vector dimension can act as an “even–odd” indicator. Ryskina et al. report an “evenness neuron” in their LSTM-based embeddings – for example, the 156th component of the vector is typically positive for even numbers and negative for odd numbers (). This held true for numbers 1 through 50 with only a few exceptions (). In other words, the model automatically allocated one dimension of the 100-dimensional embedding to encode parity information. This reflects how frequently parity distinctions arise in integer sequences (many sequences alternate even/odd or concern only evens or only odds, etc.), allowing the embedding algorithm to statistically pick up on that pattern.
Divisibility and primality: Beyond parity, more complex divisibility patterns are also encoded in these embeddings. By training simple classifiers on the learned vectors, Ryskina et al. were able to predict properties like “divisible by 3,” “divisible by 4,” or “is prime” with high accuracy – far above random chance () (). For example, using a logistic regression on the OEIS-trained FastText embeddings, they achieved nearly perfect separation of multiples of 3 vs. non-multiples on a test set of unseen integers () (). The embeddings from text-trained models (GloVe, standard Word2Vec, etc.) did not contain these signals — those performed no better than a baseline that always guesses “not divisible” (). This suggests that the OEIS-based vectors had learned regularities like modular arithmetic relationships from the data. Primality is a very nontrivial property, yet a linear classifier on the OEIS embeddings could identify primes in a range of 1000 new numbers almost perfectly (nearly 100% accuracy using all embedding dimensions) () (). No single dimension corresponds exclusively to “primeness” (it appears to be encoded as a combination of features), but the information is present in the vector. In contrast, embeddings from ordinary language corpora showed no ability to distinguish primes vs. composites (they performed at the ~87% baseline of guessing “composite” for most numbers) () (). These results indicate that number embeddings trained on mathematical sequence data naturally learn divisibility patterns (likely through contextual cues, such as certain numbers tending to appear in sequences of multiples or prime-only sequences).
Numeric magnitude and “number line” structure: One might expect an embedding space to also reflect the size of a number (since magnitude is a salient attribute). Indeed, some embeddings exhibit a clear number-line ordering in one or more dimensions. Interestingly, text-trained number embeddings (from models trained on large general corpora) had a very strong magnitude component. Researchers found that for GloVe or Wiki-trained FastText vectors, a linear regression could reconstruct a number’s value or even the number of digits with R² ~0.99 (). In fact, a single principal dimension in those embeddings correlates with the logarithm of the number: e.g. one dimension might increase steadily from “5” to “10” to “100” etc., reflecting how numbers are mentioned in increasing order in text (years, counts, etc.) (). By contrast, the OEIS-trained embeddings put much less emphasis on raw magnitude. In Ryskina’s experiments, using all dimensions of the OEIS LSTM embedding yielded an R² of ~0.86 for predicting the actual integer value (still a correlation, but weaker), and only ~0.78 for predicting the number of digits () (). This makes sense: in OEIS, numbers are often not in simple ascending order of magnitude – they follow whatever rule the sequence defines. As a result, the embedding space for OEIS data focuses more on mathematical properties than on absolute size. Still, even the OEIS-based skip-gram embeddings showed some magnitude encoding (when all dimensions were used, they too could fit the values 1–2000 with R² ≈ 0.99) (), implying that some combination of features in the vector does correlate with the number’s scale. Notably, including character subword information enhances this: since each digit contributes, the model can literally infer magnitude by reading off the length of the number. In summary, magnitude tends to be a latent factor in embeddings, especially for models exposed to lots of numeric comparisons or ordered lists. However, in a math sequences corpus, magnitude is just one factor among many, so it doesn’t dominate the embedding geometry as it might in a news or web corpus.
Clustering by mathematical sequence or category: Perhaps the most intriguing finding is that numbers which “belong together” in a mathematical sense often end up near each other in the learned vector space. The distributional training causes numbers that occur in similar contexts (i.e. in many of the same sequences) to have similar embeddings. For example, prime numbers form a noticeable cluster under OEIS-trained embeddings. If one takes a few known primes as a query and finds nearest neighbors in the embedding space, the neighbors are very likely to be other primes (). Ryskina et al. demonstrated this by taking a small “seed set” of primes (like {5, 13, 29}) and averaging their vectors: the closest vectors to that average corresponded to other prime numbers (19, 7, 17, 3, 23, etc.) (). The same experiment with a text-derived embedding (from Wikipedia) did not return primes – it returned numbers close in value (26, 28, 27, etc.) but not conceptually prime (). This shows that the OEIS embeddings encoded the concept of primality as a directional tendency, clustering primes together distinct from composites. Likewise, numbers that are perfect powers or share specific patterns tend to group together. In the same study, given a mix of powers like {243, 729, 1024} (which include powers of 3 and 2), the embedding produced nearest neighbors such as 2187, 512, 256, 64, 27 – all of which are powers of 2 or 3 (e.g. 2187 = 3^7, 27 = 3^3, 512 = 2^9, etc.) (). This suggests the model learned a notion of “being a power” even though those numbers don’t form a single arithmetic progression – it’s because these numbers appear in related OEIS sequences (powers of 2, powers of 3, etc.), so their contexts overlap, drawing them together in vector space. We can reasonably extrapolate that membership in other famous sequences would similarly be reflected. For instance, Fibonacci numbers mostly occur with each other (and perhaps with other combinatorial numbers) in the corpus, so we’d expect an embedding to place the set {1, 1, 2, 3, 5, 8, 13,…} along a certain manifold in the space. The same goes for square numbers, triangular numbers, factorials, Catalan numbers, etc. If two numbers frequently co-occur or play analogous roles across sequences, the embedding process will make their vectors more similar. In essence, each mathematical category or sequence defines a cluster or direction: one could imagine a “prime direction,” a “square-number direction,” and so on, although these are not single linear dimensions but rather subspaces where those numbers reside distinctly from random numbers.
Analogical structure: A hallmark of word embeddings is the ability to do linear analogies (“king – man + woman ≈ queen”). Researchers have tested if number embeddings support analogous reasoning for numeric relations. The results show some promise, though success is limited to certain patterns. In Ryskina et al.’s evaluation, they constructed analogy questions from standardized test puzzles (e.g. “5 is to 36 as 6 is to ?” which expects 49, since 36 = 6² and 49 = 7²) (Ryskina-Knight-2021-integers-poster). Using vector arithmetic (v(6) – v(5) + v(36) and finding the closest vector) with different embeddings yielded mixed performance. The OEIS-trained LSTM and LSA embeddings did not outperform random guessing, and even a GPT-2 model fine-tuned on sequences struggled () (). However, the FastText embedding (with subword digits) trained on OEIS did better than chance (about 34% accuracy on multiple-choice analogy questions, versus a 28% random baseline) (). The authors attribute this to the linear regularities that the skip-gram method with subword information can encode (). Essentially, because FastText treats parts of the number (digits) as features, operations like adding 1 or squaring might correspond to consistent shifts in the vector (e.g. incrementing the last digit or changing magnitude). For example, an analogy like 2:4 :: 3:? (expecting 9, if the pattern is “square the number plus 1”) could in principle be solved if the vector difference between 4 and 2 is similar to that between 9 and 3. In practice, only a subset of such analogies are captured reliably. The simpler arithmetic progressions or increment patterns are more likely to be encoded linearly. Indeed, Ryskina’s model performed better on analogies drawn from well-known sequences (like consecutive squares, Fibonacci-like progressions, etc.) than on completely novel puzzles (Ryskina-Knight-2021-integers-poster) (). This indicates that while the embedding space isn’t a perfect Euclidean model of all math operations, certain relations do correspond to vector differences. Those relations that were common in the training data (e.g. “add 1 each step” or “multiply by a fixed factor”) manifest as directions the model can learn. More abstract or rare relationships (like the example “5:36 :: 6:49”, which is a square-numbers pattern) might not be consistently learned unless subword features give it a clue (here, recognizing 36 and 49 as perfect squares of 6 and 7). Overall, analogical reasoning with number embeddings is still rudimentary, but the presence of any success at all is noteworthy – it means the embedding geometry is not arbitrary, but aligns with some arithmetic transformations () ().

In summary, when integers are embedded as individual tokens based on sequence co-occurrence, the resulting vector space is rich in latent numeric knowledge. Basic number-theoretic properties (parity, divisibility, primality) end up encoded either along single dimensions or in linear combinations of dimensions () (). Quantitative attributes like magnitude can also emerge, especially if the training data emphasizes them (). Perhaps most interestingly, the embedding space captures semantic groupings of numbers: those that belong to the same well-defined sequence or category are located near each other in the space, making it possible to retrieve, say, other primes or other square numbers by vector proximity (). This kind of structure can be exploited for tasks like sequence classification or completion. In fact, the motivation behind learning these embeddings is often to improve performance on downstream tasks in mathematical AI. For example, Ryskina et al. showed that using OEIS-trained number embeddings significantly improved a model’s ability to complete integer sequences and solve analogy questions compared to using off-the-shelf word embeddings () (). The learned vectors encode information that a model can leverage to guess the next term of a sequence or identify what property a sequence might obey.

Sources:

Maria Ryskina and Kevin Knight. “Learning Mathematical Properties of Integers.” (2021) – Trained embeddings on OEIS and analyzed parity, divisibility, etc () () () ().
Thibault Gauthier and Josef Urban. “Learning Program Synthesis for Integer Sequences from Scratch.” (2022) – Used a specialized integer encoding (digits and a <BIG> token) to handle large numbers ([2202.11908] Learning Program Synthesis for Integer Sequences from Scratch).
Avijit Thawani et al. “Representing Numbers in NLP: a Survey and a Vision.” NAACL 2021 – Survey of numeric representation techniques (discusses word embeddings vs. dedicated numeric models) () ().
Gwern Branwen. “Number Search Engine via NN Embeddings.” (2024) – Proposal to train a Transformer on scientific texts to embed numbers, suggesting character tokenization and contrastive learning for numeric similarity (Number Search Engine via NN Embeddings · Gwern.net) (Number Search Engine via NN Embeddings · Gwern.net).
OEIS (Online Encyclopedia of Integer Sequences) – Primary source of integer sequence data for many of the above studies (Ryskina-Knight-2021-integers-poster) ().